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OPTIMAL SELECTION OF REDUCED RANK ESTIMATORS OF 
HIGH-DIMENSIONAL MATRICES 

By Florentina Bunea*, Yiyuan She and Marten H. Wegkamp* 

Florida State University 
We introduce a new criterion, the Rank Selection Criterion (RSC), 
for selecting the optimal reduced rank estimator of the coefficient 
matrix in multivariate response regression models. The correspond- 
ing RSC estimator minimizes the Frobenius norm of the fit plus a 
regularization term proportional to the number of parameters in the 
reduced rank model. 

The rank of the RSC estimator provides a consistent estimator of 
' the rank of the coefficient matrix; in general the rank of our estimator 

^0 is a consistent estimate of the effective rank, which we define to be the 

number of singular values of the target matrix that are appropriately 
large. The consistency results are valid not only in the classic asymp- 
totic regime, when n, the number of responses, and p, the number of 
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predictors, stay bounded, and m, the number of observations, grows, 

> 

but also when either, or both, n and p grow, possibly much faster 
than m. 

C^l We establish minimax optimal bounds on the mean squared er- 

rors of our estimators. Our finite sample performance bounds for the 
RSC estimator show that it achieves the optimal balance between the 
• • approximation error and the penalty term. 

Furthermore, our procedure has very low computational complex- 
S-H ity, linear in the number of candidate models, making it particularly 

appealing for large scale problems. We contrast our estimator with 
the nuclear norm penalized least squares (NNP) estimator, which 

has an inherently higher computational complexity than RSC, for 
multivariate regression models. We show that NNP has estimation 

properties similar to those of RSC, albeit under stronger conditions. 
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However, it is not as parsimonious as RSC. We offer a simple correc- 



tion of the NNP estimator which leads to consistent rank estimation. 



We verify and illustrate our theoretical findings via an extensive 



simulation study. 



1. Introduction. In this paper we propose and analyze dimension reduction- type estimators 
for multivariate response regression models. Given m observations of the responses Y{ £ W 1 and 
predictors Xi 6 P, we assume that the matrices Y — \Yi-> • • • ■> ^m]' and X — [-^lj • • • 3 A" m ]' are 
related via an unknown p x n matrix of coefficients A, and write this as 



where E is a random m x n matrix, with independent entries with mean zero and variance a 2 . 

Standard least squares estimation in (1), under no constraints, is equivalent to regressing each 
response on the predictors separately. It completely ignores the multivariate nature of the possibly 
correlated responses, see, for instance, Izenman (2008) for a discussion of this phenomenon. Esti- 
mators restricted to have rank equal to a fixed number k < n A p were introduced to remedy this 
drawback. The history of such estimators dates back to the 1950's, and was initiated by Anderson 
(1951). Izenman (1975) introduced the term reduced-rank regression for this class of models and 
provided further study of the estimates. A number of important works followed, including Robin- 
son (1973, 1974) and Rao (1978). The monograph on reduced rank regression by Reinsel and Velu 

(1998) has an excellent, comprehensive account of more recent developments and extensions of the 
model. All theoretical results to date for estimators of A constrained to have rank equal to a given 
value k are of asymptotic nature and are obtained for fixed p, independent of the number of obser- 
vations to. Most of them are obtained in a likelihood framework, for Gaussian errors Ey. Anderson 

(1999) relaxed this assumption and derived the asymptotic distribution of the estimate, when p is 
fixed, the errors have two finite moments, and the rank of A is known. Anderson (2002) continued 
this work by constructing asymptotic tests for rank selection, valid only for small and fixed values 
of p. 

The aim of our work is to develop a non-asymptotic class of methods that yield reduced rank 
estimators of A that are easy to compute, have rank determined adaptively from the data, and are 
valid for any values of m, n and p, especially when the number of predictors p is large. The resulting 




Y = XA + E 
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estimators can then be used to construct a possibly much smaller number of new transformed 
predictors or can be used to construct the most important canonical variables based on the original 
X and Y. We refer to Chapter 6 in Izenman (2008) for a historical account of the latter. 

We propose to estimate A by minimizing the sum of squares \\Y— XB\\"p = (XB)ij} 2 
plus a penalty fir(B), proportional to the rank r(B), over all matrices B. It is immediate to see, 
using Pythagoras' theorem, that this is equivalent with computing min^ {H-P^ — + fir(B)} 

or min/% jming. r (B)=k \\PY — XB\\"p + /u/c}, with P being the projection matrix onto the column 
space of X. In Section 2.1 we show that the minimizer k of the above expression is the number of 
singular values dk(PY) of PY that exceed /i 1 / 2 . This observation reveals the prominent role of the 
tuning parameter ^ in constructing k. The final estimator A of the target matrix A is the minimizer 
of \\PY — XBWp over matrices B of rank k, and can be computed efficiently even for large p, using 
the procedure that we describe in detail in Section 2.1 below. 

The theoretical analysis of our proposed estimator A is presented in Sections 2.2 - 2.4. The rank 
of A may not be the most appropriate measure of sparsity in multivariate regression models. For 
instance, suppose that the rank of A is 100, but only three of its singular values are large and the 
remaining 97 are nearly zero. This is an extreme example, and in general one needs an objective 
method for declaring singular values as "large" or "small". We introduce in Section 2.1 a slightly 
different notion of sparsity, that of effective rank. The effective rank counts the number of singular 
values of the signal XA that are above a certain noise level. The relevant notion of noise level turns 
out to be the largest singular value of PE. This is central to our results, and influences the choice 
of the tuning sequence [x. In Appendix C we prove that the expected value of the largest singular 
value of PE is bounded by {q + n) 1 / 2 , where q < m Ap is the rank of X. The effective noise level 
is at most (m + ra) 1 / 2 , for instance in the model Y = A + E, but it can be substantially lower, of 
order (q + ra) 1 / 2 , in model (1). 

In Section 2.2 we give tight conditions under which k, the rank of our proposed estimator A, 
coincides with the effective rank. As an immediate corollary we show when k equals the rank of 
A. We give finite sample performance bounds for \\XA — XA\\^ in Section 2.3. These results show 
that A mimics the behavior of reduced rank estimates based on the ideal effective rank, had this 
been known prior to estimation. If X has a restricted isometrity property, our estimate is minimax 
adaptive. In the asymptotic setting, for n+(mAp) > n+q — > oo, all our results hold with probability 
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close to one, for tuning parameter chosen proportionally to the square of the noise level. 

We often particularize our main findings to the setting of Gaussian N(0, a 2 ) errors Eij in order to 
obtain sharp, explicit numerical constants for the penalty term. To avoid technicalities, we assume 
that a 2 is known in most cases, and we treat the case of unknown a 2 in Section 2.4. 

We contrast our estimator with the penalized least squares estimator A corresponding to a 
penalty term r[|.B[|i proportional to the nuclear norm [|J3[|i = Ylj dj(B), the sum of the singular 
values of B. This estimator has been studied by, among others, Yuan et al. (2007) and Lu et al 
(2010), for model (1). Nuclear norm penalized estimators in general models y = X(A) +e involving 
linear maps X have been studied by Candes and Plan (2010) and Negahban and Wainwright 
(2009). A special case of this model is the challenging matrix completion problem, first investigated 
theoretically, in the noiseless case, by Candes and Tao (2010). Rohde and Tsybakov (2010) studied 
a larger class of penalized estimators, that includes the nuclear norm estimator, in the general 
model y = X(A) + e. 

In Section 3 we give bounds on \\XA — XA\f F that are similar in spirit to those from Section 
2. While the error bounds of the two estimators are comparable, albeit with cleaner results and 
milder conditions for our proposed estimator, there is one aspect in which the estimates differ in 
important ways. The nuclear norm penalized estimator is far less parsimonious than the estimate 
obtained via our rank selection criterion. In Section 3, we offer a correction of the former estimate 
that yields a correct rank estimate. 

Section 4 complements our theoretical results by an extensive simulation study that supports 
our theoretical findings and suggests strongly that the proposed estimator behaves very well in 
practice, in most situations is preferable to the nuclear norm penalized estimator and it is always 
much faster to compute. 

Technical results and some intermediate proofs are presented in Appendices A - D. 

2. The Rank Selection Criterion. 

2.1. Methodology. We propose to estimate A by the penalized least squares estimator 
(2) l=argmin{||F-X J B||| + /ir(S)}. 

B 

We denote its rank by k. The minimization is taken over all p x n matrices B. Here and in what 
follows r(B) is the rank of B and \\C\\f = [YliYlj ^ij) denotes the Frobenius norm for any 
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generic matrix C. The choice of the tuning parameter \i > is discussed in Section 2.2. Since 

(3) min {\\Y - XB\\ 2 F + (ir(B)\ = mini min {\\Y - XB\\ 2 F + fik} 1 , 

B k [B, r(B)=k J 

one needs to compute the restricted rank estimators B k that minimize \\Y— XB\\ F over all matrices 
B of rank k. The following computationally efficient procedure for calculating each B k has been 
suggested by Reinsel and Velu (1998). Let M = X'X be the Gram matrix, M~ be its Moore-Penrose 
inverse and let P = XM~X' be the projection matrix onto the column space of X. 

1. Compute the eigenvectors V = [v i, V2, • • • ,v n ], corresponding to the ordered eigenvalues ar- 
ranged from largest to smallest, of the symmetric matrix Y'PY. 

2. Compute the least squares estimator B = M~X'Y. 
Construct W = BY and G = V . 

Form W k = W[,l-k] and G k = G[l : k, }. 

3. Compute the final estimator B k = W k G k . 

In step 2 above, W k denotes the matrix obtained from W by retaining all its rows and only its 
first k columns, and G k is obtained from G by retaining its first k rows and all its columns. 

Our first result, Proposition 1 below, characterizes the minimizer k = r(A) of (3) as the number 
of eigenvalues of the square matrix Y'PY that exceed \x or, equivalently, as the number of singular 
values of the matrix PY that exceed /i 1//2 . The final estimator of A is then A = Bt. 

Lemma 14 in Appendix B shows that the fitted matrix XA is equal to ^j^djUjVj based on 
the singular value decomposition UDV = ^ - djUjv'j of the projection PY. 

PROPOSITION 1. Let X^Y'PY) > X 2 (Y'PY) > ••• be the ordered eigenvalues of Y'PY. We 
have A = B^ with 

(4) k = max {k : X k {Y'PY) > n) . 

Proof. For B k given above, and by the Pythagorean theorem, we have 

\\Y - XB k f F = \\Y - PYf F + \\PY - XB k \\ 2 F , 
and we observe that XB = PY. By Lemma 14 in Appendix B, we have 

\\XB- XB k f F = Y J d 2 j {XB) = ^2d 2 (PY) = ^2\ j (Y'PY), 

j> k j>k j>k 
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where dj(C) denotes the j-th largest singular value of a matrix C. Then, the penalized least squares 
criterion reduces to 

\\Y - PYf F + < H Y ' PY ) + » k 

j>k 

and we find that ming {\\Y — XB\\ 2 F + /xr(£?)} equals 

\\Y - PY\\ 2 F - fj,n+ min^ {Xj(Y'PY) - fj,} . 

j>k 

It is easy to see that ^2j > f : {Xj(Y'PY) — fi} is minimized by taking k as the largest index j for 
which Xj(Y'PY) — \i > 0, since then the sum only consists of negative terms. This concludes our 
proof. □ 

Remark. The two matrices Wt and G?, that yield the final solution A = Wj-Gt, have the follow- 
ing properties: (i) G^G~ is the identity matrix; and (ii) W-^MW^ is a diagonal matrix. Moreover, 
the decomposition of A as a product of two matrices with properties (i) and (ii) is unique, see, for 
instance, Theorem 2.2 in Reinsel and Velu (1998). As an immediate consequence, one can construct 
new orthogonal predictors as the columns of Z = XWr. If k is much smaller than p, this can result 
in a significant dimension reduction of the predictors' space. 



2.2. Consistent effective rank estimation. In this section we study the properties of k = r(A). 
We will state simple conditions that guarantee that k equals r = r(A) with high probability. First, 
we describe in Theorem 2 what k estimates and what quantities need to be controlled for consistent 
estimation. It turns out that k estimates the number of the singular values of the signal XA above 
the threshold /U 1 / 2 , for any value of the tuning parameter /i. The quality of estimation is controlled 
by the probability that this threshold level exceeds the largest singular value d\ (PE) of the projected 
noise matrix PE. We denote the jth singular value of a generic matrix C by dj(C) and we use the 
convention that the singular values are indexed in decreasing order. 

Theorem 2. Suppose that there exists an index s < r such that 

d s (XA) > (1 + 8)yfji and d s+1 {XA) < (1 - 5)^Jl, 
for some 5 £ (0, 1]. Then we have 

P j£ = s } > l-P{di(PE) > • 
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Proof. Using the characterization of k given in Proposition 1 we have 

k>s <=> v77< d s+1 {PY) 
k<s ^>d s (PY). 

Therefore P jfc / s| = Pj^/J < d s+ i(Py) or y/Ji > d s (PY)} . Next, observe that PY = XA + 
PE and <fc(XA) < d k (PY) + d 1 (PE) for any fc. Hence d s (PY) < /W 2 implies di(P£) > d s (XA) - 
/i 1//2 , whereas d s +i(PY) > /i 1 / 2 implies that d\(PE) > n 1 / 2 — d s +\{XA). Consequently we have 

pjfc/sj <¥{di{PE) > nnn(y/Ji-d a+ i(XA),d a (XA) - y/JT)} . 

Invoke the conditions on d s +i(XA) and d s {XA) to complete the proof. □ 

Theorem 2 indicates that we can consistently estimate the index s provided we use a large enough 
value for our tuning parameter [i to guarantee that the probability of the event \d\(PE) < 5/x 1 / 2 } 
approaches one. We call s the effective rank of A relative to /i, and denote it by r e = r e (ff). 

This is the appropriate notion of sparsity in the multivariate regression problem: we can only 
hope to recover those singular values of the signal XA that are above the noise level E[di(PE)]. 
Their number, r e , will be the target rank of the approximation of the mean response, and can be 
much smaller than r = r(A). We regard the largest singular value d\{PE) as the relevant indicator 
of the strength of the noise. Standard results on the largest singular value of Gaussian matrices show 
that E[di(£?)] < a{vn}l 2 + n 1 / 2 ) and similar bounds are available for subGaussian matrices, see, for 
instance, Rudelson and Vershynin (2010). Interestingly, the expected value of the largest singular 
value di(PE) of the projected noise matrix is smaller: it is of order (q + n) 1 / 2 with q = r(X). If E 
has independent N(0,a 2 ) entries the following simple argument shows why this is the case. 

Lemma 3. Let q = r(X) and assume that Eij are independent N(0, a 2 ) random variables. Then 

E[d 1 (PE)]<o-(Vn-+^q-) 

and 

F{d!(PE) > E[dt(PE)] + at} < exp (-t 2 /2) 

for all t > 0. 
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Proof. Let UAU' be the eigen-decomposition of P. Since P is the projection matrix on the 
column space of X, only the first q entries of A on the diagonal equal to one, and all the remaining 
entries equal to zero. Then, d\(PE) = X^E'PE) = d\{KU'E). Since E has independent N(0,a 2 ) 
entries, the rotation U'E has the same distribution as E. Hence AU'E can be written as a q x n 
matrix with Gaussian entries on top of a (m—q)xn matrix of zeroes. Standard random matrix theory 
now states that K[di(AU' E)] < a(q l l 2 +n 1 / 2 ). The second claim of the lemma is a direct consequence 
of Borell's inequality, see, for instance, Van der Vaart and Wellner (1996), after recognizing that 
d\{AU'E) is the supremum of a Gaussian process. □ 

In view of this result, we take /i 1 / 2 > cj{n 1 / 2 + q 1 ^ 2 ) as our measure of the noise level. The 
following corollary summarizes the discussion above and lists the main results of this section: the 
proposed estimator based on the rank selection criterion (RSC) recovers consistently the effective 
rank r e and, in particular, the rank of A. 

Corollary 4. Assume that E has independent N(0,a 2 ) entries. For any 9 > 0, set 

V = (l + 6) 2 a 2 (^i + Vq) 2 /5 2 
with 5 as in Theorem 2. Then we have, for any 9 > 0, 

F{k / r e (^)} < exp ^-^9 2 (n + q)^j -> as q + n -> oo. 

In particular, if d r (XA) > 2^ 1 / 2 and /i 1//2 = (1 + 9)o~(y/n + ^fq), then 

W{k / r} < exp ^-^6> 2 (n + q)^j — > as q + n—t oo. 

Remark. Corollary 4 holds when q+n —> oo. If q+n stays bounded, but m — > oo, the consistency 
results continue to hold when q is replaced by gln(m) in the expression of the tuning parameter 
[i given above. Lemma 3 justifies this choice. The same remark applies to all theoretical results in 
this paper. 



Remark. A more involved argument is needed in order to establish the conclusion of Lemma 3 
when E has independent subGaussian entries. We give this argument in Proposition 15 presented 
in Appendix C. Proposition 15 shows, in particular, that when K[exp(tEij)] < exp(t 2 /r^;) for all 
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t > 0, and for some < oo, we have 

¥{d\{PE) > 32T E (q + n)(ln(5) + x)} < 2 exp{-x(q + n)}, 

for all x > 0. The conclusion of Corollary 4 then holds for n = Col^n + q) with Co large enough. 
Moreover, all oracle inequalities presented in the next sections remain valid for this choice of the 
tuning parameter, if E has independent subGaussian entries. 



2.3. Errors bounds for the RSC estimator. In this section we study the performance of A by 
obtaining bounds for \\XA — XA\\ 2 F . First we derive a bound for the fit — Xt4|||,, based on 

the restricted rank estimator for each value of k. 

Theorem 5. Set c(9) = 1 + 2/6. For any 9 > 0, we have 

\\XB k - XA\\ 2 F < \ c 2 {6) d 2 (XA) + 2(1 + e)c(6)kd 2 1 {PE) 
{ j>k 

with probability one. 

Proof. By the definition of B k , 

\\Y-XB k \\ 2 F < \\Y-XB\\ 2 F 
for all p x n matrices B of rank k. Working out the squares we obtain 

\\XB k -XA\\ 2 F < \\XB - XA\\ 2 F + 2 < E,XA- XB > F 
= \\XB - XA\\ 2 F + 2 < PE,XA- XB > F 

with 

<C,D> F = tr(C'D) = tr{D'C) = Y,Y1 C v D ij^ 

i 3 

for generic mx n matrices C and D. The inner product < C, D >f, operator norm ||C||2 = d\(C) 
and nuclear norm \\D\\i = Y2jdj(D) are related via the inequality < C,D >_p< [|C||2||-D||i- As a 
consequence we find 

<PE,XB k -XB> F < di{PE)\\XB k - XB\\ X 

< di{PE)V2k\\XB k - XB\\ F 

< d 1 (PE)V2k{\\XB k - XA\\ F + \\XB - XA\\ F }. 
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Using the inequality 2xy < x 2 /a + ay 2 with a > twice, we obtain that ||-X".Bfc — is bounded 

above by 

' i -^\\XB - XAf F + -\\XB k - XA\\l + (a + b)(2k)d\(PE). 
b a 

Hence we obtain, for any a, b > 0, the inequality 

\\XB k -XAf F < -^^S^^WXB - XAf F + 2{a + b)kd\{PE)\. 

Lemma 14 in the Appendix B states that the minimum of \\XA — XB\\ F over all matrices B of 
rank k is achieved for the GSVD of A and the minimum equals ^2j >k d 2 (XA). The claim follows 
after choosing a = (2 + 0)/2 and b = 6/2. □ 

Corollary 6. Assume that E has independent N(0,a 2 ) entries. Set c{6) = 1 + 2/6. Then, for 
any 6, £ > 0, the inequality 

\\XBj~ — XA\\ F 

< I c 2 (6)^d 2 (XA) + 2c(0)(l + 0)(1 + £) 2 a 2 k(n + t 
holds with probability 1 — exp(— £ 2 (n + q)/2). In addition, 



E 



\XB k - < Y, d ]( XA ) + ° 2 Hn + q). 



j>k 

The symbol < means that the inequality holds up to multiplicative numerical constants. 



PROOF. Set t = (1 + C) 2 a 2 {y/n + ^fq) 2 for some £ > 0. From Lemma 3, it follows that 

¥{d\{PE) >t}= ¥{dx{PE) > (1 + £Mv^ + y/q)} < exp(-f(n + q)/2). 

The first claim follows now from this bound and Theorem 5. From Lemma 16, it follows that 
E[df {PE)\ < v 2 + vy/2lr + 2 for v = E[di(P^7)] < <r(v^ + y^). This proves the second claim. □ 



Theorem 5 bounds the error — Xj4||p by an approximation error, 'Yl j >k d 2 (X A) , and a 

stochastic term, kd 2 (PE), with probability one. The approximation error is decreasing in k and 
vanishes for k > r(XA). 

The stochastic term increases in k and can be bounded by a constant times k(n + q) with 
overwhelming probability and in expectation, for Gaussian errors, by Corollary 6 above. More 
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generally, the same bound (up to constants) can be proved for subGaussian errors. Indeed, for Co 
large enough, Proposition 15 in Appendix C, states that F{d 2 (PE) < Co(n + q)} < 2 exp{— (n+q)}. 

We observe that k(n + q) is essentially the number of free parameters of the restricted rank 
problem. Indeed, our parameter space consists of all p x n matrices B of rank k and each XB 
matrix has k(n + q — k) free parameters. Hence we can interpret the bound in Corollary 6 above as 
the squared bias plus the dimension of the parameter space. 

Remark (ii), following Corollary 8 below, shows that k{n + q) is also the minimax lower bound 
for — X^4||^, if the smallest eigenvalue of X'X is larger than a strictly positive constant. 

This means that XB^ is a minimax estimator under this assumption. 

We now turn to the penalized estimator A and show that it achieves the best (squared) bias- 
variance trade-off among all rank restricted estimators B^ for the appropriate choice of the tuning 
parameter [i in the penalty pen(S) = fir(B). 

Theorem 7. We have, for any 9 > ; on the event (1 + 9)d\(PE) < /j, 
(5) \\XA-XA\\ 2 F < c 2 (e)\\XB-XA\\ 2 F + 2c{9)fik, 

for any p x n matrix B. In particular, we have, for //>(! + 9)d\{PE) 



(6) \\XA - XAf F < min < 

k 



c 2 (9)^2d 2 (XA) + 2c(9)nk 

j>k 



and 

(7) \\XA- XA\\ 2 F <2c(9)fir. 

Proof. By definition of A, 

\\Y - XA\\ 2 F + ixr{A) < \\Y - XB\\ 2 F + fir(B) 
for all p x n matrices B. Working out the squares we obtain 
\\XA-XA\\ 2 F 

< \\XB - XA\\ 2 F + 2fj,r(B) + 2 < E, XA - XB > F -fir(A) - fir(B) 
= \\XB - XA\\ 2 F + 2fir(B) + 2 < PE,XA-XB > F -fj,r(A) - fJ,r(B). 
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Next we observe that 

< PE,XA- XB > F 

< di{PE)\\XA- XB\\x 

< d 1 (PE){r(XA) + r{XB)} l / 2 \\XA- XB\\ F 

< dx(PE){r(A) + r{B)} 1/2 {\\XA - XA\\ F + \\XB - XA\\ F }. 

Consequently, using the inequality 2xy < x 2 /a + ay 2 twice, we obtain, for any a > and b > 0, 

\\XA-XA\\ 2 F < \\XB-XA\\ F + -\\XA-XA\\ F + -\\XB-XA\\ F + 

a ' b 

2nr{B) + (a + b){r(A) + r{B)}d\{PE) - fi{r(A) + r{B)}. 

Hence, if (a + b)d\{PE) — [i < 0, we obtain 

\\XA-XA\\ 2 F < -^{l±^\\XB-XA\\ 2 F + 2^r{B)\ , 
a — 1 I b J 

for any a > 1 and b > 0. Lemma 14 in Appendix B evaluates the minimum of ||X^4 — over all 

matrices B of rank k and shows that it equals ^j>fe d 2 (XA). We conclude our proof by choosing 

a = 1 + 6/2 and 6 = 0/2. □ 

Remark. The first two parts of the theorem show that ^4 achieves the best (squared) bias- 
variance trade-off among all reduced rank estimators B^ if fi > d\{PE). Moreover, the index 
k which minimizes ^2j > k{d 2 (XA) + fik} essentially coincides with the effective rank r e = r e (/i) 
defined in the previous section. Therefore, the fit of the selected estimator XA is comparable with 
that of the estimator XB^ with rank k = r e . Since the ideal r e depends on the unknown matrix A, 
this ideal estimator cannot be computed. Although our estimator A is constructed independently 
of r e , it mimics the behavior of the ideal estimator B Te and we say that the bound on ||Ayl — Aj4|||, 
adapts to r e < r. 

The last part of our result is a particular case of the second part, but it is perhaps easier to 
interpret. Taking the index k equal to the rank r, the bias term disappears and the bound reduces 
to rd 2 (PE) up to constants. This shows clearly the important role played by r in the estimation 
accuracy: the smaller the rank of A, the smaller the estimation error. 



For Gaussian errors, we have the following precise bounds. 
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Corollary 8. Assume that E has independent N(0,a 2 ) entries. Set 



pen{B) = (1 + 0)(1 + 2 (Vn~ + y/q) 2 a 2 r(B) 
with 9, £ > arbitrary. Let c{9) = 1 + 2/9. Then, we have 



and 



\XA-XA\\ 2 F < min <c 2 (9)y^d 2 (XA) + 2c{9)fik 

l<fc<min(n,p) | f— f J 

, , £ 2 (n+?) 

> 1 — exp ' 



E 



\XA-XA\\ 2 F 



< min 

l<fc<min(n,p) 



c 2 {9) d 2 {XA) + 2(1 - 0)c(0)(l + 2 ^ 2 (V" + V^) 2 * 
i>fc 

+4(1 + 0)c(0) min(n,p)cj 2 (l + f -1 ) exp(-£ 2 (n + q)). 



Proof. Recall from the proof of Theorem 7 that 

\\XA-XA\\ 2 F < 2 ^-!^^-\\XB-XA\\ 2 F + 2pen(B)+R 

with R defined by 

R = (l + 9){r(A) + r(B)}d 2 1 (PE) -pen{A) -pen{B) 

< 2(1+9) max k {d\(P E) - {1 + £,) 2 {^i + ^/q) 2 a 2 } . 

l</c<min(n,p) 

For E = E/a, a matrix of independent N(0, 1) entries, we have 

{d 2 (PE)-(i + o\Vn- + Vq) 2 } 



R < 2(1 + 0W 2 max k 



l<fc<min(n,p) 

,2 ( j2tr>-\ n , ^2, 



< 2min(n,p)(l + 9)a z (d((PE) - (1 +0 + Vq) 



+ 



Apply Lemma 16 in Appendix D to deduce that 



1 + £ 

E[R] < 4min(n,p) — -(1 + 9)a 2 e^(-f{n + q)/2). 



The conclusion follows immediately. 
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Remarks, (i) We note that for n + q large, 



E \\XA-XA\\ 2 F < 



min 

l<fc<min(n,p) 



d){XA) + <r 2 (v^ + y/qfk 



j>k 



as the remainder term in the bound of E \\XA — XA 



in Corollary 8 converges exponentially 



fast in n + q, to zero. 

(ii) Assuming that E has independent N(0, a 2 ) entries, the RSC estimator corresponding to 



X having a restricted isometry property (RIP), of the type introduced and discussed in Candes 
and Plan (2010) and Rohde and Tsybakov (2010). The RIP implies that \\XA\\% > p\\A\\ 2 F , for all 
matrices A of rank at most r and for some constant < p < 1. For fixed design matrices X, this is 
equivalent with assuming that the smallest eigenvalue X P (M) of the p x p Gram matrix M = X'X 
is larger than p. To establish the minimax lower bound for the mean squared error \\XA — XA\\ F , 
notice first that our model (1) can be rewritten as yi = trace(Z 4 'j4) + £j, with 1 < i < mn, via 
the mapping (a, b) — > i = a + (b — l)n, where 1 < a < m, 1 < b < n, yi =: Y a b G R and 
Zi =: X' a eb £ Mpxn- Here X a G M p denotes the a-th row of X, et> is the row vector in W 1 having the 
6-th component equal to 1 and the rest equal to zero, and M pxn is the space of all p x n matrices. 
Then, under RIP, the lower bound follows directly from Theorem 5 in Rohde and Tsybakov (2010); 
see also Theorem 2.5 in Candes and Plan (2010) for minimax lower bounds on \\A — A\\ 2 F . 

(iii) The same type of upper bound as the one of Corollary 8 can be proved if the entries of E 
are subGaussian: take pen(B) = C(n + q)r(B) for some C large enough, and invoke Proposition 15 
in Appendix C. 

(iv) Although the error bounds of 11X^4 — AA||i? are guaranteed for all X and A, the analysis 
of the estimation performance of A depends on X. If A P (M) > p > 0, for some constant p, then, 
provided p > (1 + 6)d\(PE) with 9 > arbitrary, 



follows from Theorem 7. 

(v) Our results are slightly more general than stated. In fact, our analysis does not require that 
the postulated multivariate linear model Y = XA + E holds exactly. We denote the expected value 



the penalty pen(l?) = C<r 2 (n 1 / 2 + q l l 2 ) 2 r(B), for any C > 1, is minimax adaptive, for matrices 




j>k 
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of Y by and write Y = Q + E. We denote the projection of onto the column space of X by 
XA, that is, P0 = XA. Because minimizing \\Y — XBWj? + fir(B) is equivalent with minimizing 
\\PY - XB\\ 2 F + /j.r(B) by Pytha goras' theorem, our least squares procedure estimates XA, the 
mean of PY. The statements of Theorems 2 and 7 remain unchanged, except that XA is the mean 
of the projection PY of Y, not the mean of Y itself. 



2.4. A data adaptive penalty term. In this section we construct a data adaptive penalty term 
that employs the unbiased estimator 

S 2 = \\Y - PY\\ 2 F /{mn- qn) 

of a 2 . Set, for any 9 > 0, £ > and < 5 < 1, 



pen(P) 



1-5 



(l + C) 2 (^+Vg) 2 5 2 r(B). 



Notice that the estimator 5 2 requires that n(m — g) be large, which holds whenever m » q or 
m — q > 1 and n is large. The challenging case m = q « p is left for future research. 

Theorem 9. Assume that E is an m x n matrix with independent N(0,a 2 ) entries. Using the 
penalty given above we have, for c(9) = 1 + 2/9, 



E 



\XA-XA\\ 2 F 



< min 

l<fc<min(n,p) 



c 2 {9) d](XA) + 2(1 + 9)c{9){\ + £) + ^) 2 A; 

£ 2 (n + c/) 



+4(1 + 0)c(0) min(n,p)a 2 (l + C l ) 

exp 

+4(l + 0)c(0)min(n,p)cj 2 (2 + (Vn + V</) 2 + (v 7 ™ + v^) ^ ) x 
5 2 n(m — q) 

x exp ' 



4(1 + 5) 

Proof. Set E = a~ 1 E. We have, for any p x n matrix P 



2 + . 



2 + 



|X J B-XA|| 2 7 + 2pen(P) 



^9 2 + n^m2 t /^ P ^ (l + g) 2 (v^ + Vg) 2 ^ 2 
+2— —(l + 9)a max fc^d^PP) — f 

9 l<fc<min(n,p) [ (1 - 8)<T 2 
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It remains to bound the expected value of 



S 2 



max k di(PE) - (1 + fl^Vn + ^ 

k<min(n,p) { (1 — 0)(T Z 

< min(n,p) (dj(PE) - (1 + £) 2 (v^ + \/g) 2 (1 ^2 
We split the expectation into two parts: S 2 > (1 — 5)a 2 and its complement. We observe first that 



E 



< E 



S 2 



d((PE) - (i + o 2 (V^ + Vq) {1 _ 6)(7 2 ) + hs-^isw} 

d 2 (PE)-(l + 0\Vn + VQ) 2 



< 2(1 + r 1 ) min(n,p) exp(-£(V^ + y/q)/2), 
using Lemma 16 for the last inequality. Next, we observe that 

E ' " ' 

< E 



d\{PE) - (i + e) 2 (v^ + Vq) {1 _ s)a2 ) + hs»<(i-s)**} 

dl(PE)l{S*<(l-S)a2}\ = E [ d l^ P ^) l {\\{I-P)E\\ 2 F <(l-&){nrn-nq)} 



Since PE and (I — P)2? are independent, and ||(J — P)E\\ 2 F has a Xnm-no distribution, we find 



E 
< E 



S 2 



d((PE) - (1 + 2 (V^ + VQ) (1 _ S)(j2 J l { ffl<(i-^) CT2} 
di(P^) p{||(I-P)£||| < (l-<5)(nm-ng)} 



< ((V» + ^qf + V2n(V^+^q) + 2) ex p{- 4( / +(5 ) n ( m - 9)} » 



using Lemmas 16 and 17 in Appendix D for the last inequality. This proves the result. 



Remark. We see that for large values of n + q and n{m — q), 



□ 



E 



\XA-XA\\p 



< 



mm 



l</c<min(n,p) 



j>k 



as the additional terms in the theorem above decrease exponentially fast in n + q and n(m — q). 
This bound is similar to the one in Corollary 8, obtained for the RSC estimator corresponding to 
the penalty term that employs the theoretical value of a 2 . 



HIGH DIMENSIONAL REDUCED RANK ESTIMATOR 17 

3. Comparison with nuclear norm penalized estimators. In this section we compare our 
RSC estimator A with the alternative estimator A that minimizes 

||y-XS||| + 2r||5||i 

over all p x n matrices B. 

Theorem 10. On the event di(X'E) < t, we have, for any B, 

\\XA-XA\\ 2 F < \\XB - XAf F + 4r||S||i. 
PROOF. By the definition of A, 

\\Y - XA\\ F + 2r||2||i < \\Y - XB\\ 2 F + 2r||S||i 
for all m x n matrices B. Working out the squares we obtain 

\\XA-XA\\ 2 F < \\XB-XA\\ F + 2r\\B\\i + 2<X'E,A-B> F -2r||A||i 

Since 

< X'E,A- B > F < \\X'E\\ 2 \\A- BHi < t\\A- B\\t 

on the event d\{X' E) < r, we obtain the claim using the triangle inequality. □ 

We see that A balances the bias term \\XA — A^£>||^ with the penalty term r||B||i, provided 
r > d x (X'E). Since X'E = X'PE + X'(I - P)E = X'PE, we have d x {X'E) < d 1 (X)d 1 (PE). We 
immediately obtain the following corollary using the results for d\{PE) of Lemma 3. 

Corollary 11. Assume that E has independent iV(0, a 2 ) entries. For 

T={l + 6)d 1 {X)o{Vn~+^q) 

with 6 > arbitrary, we have 

f{\\XA- XA\\ 2 F < ||XB-XA||! + 4t||.B||i} > 1 -exp|-^6» 2 (n + g)| . 



18 



F. BUNEA, Y. SHE, AND M.H. WEGKAMP 



The same result, up to constants, can be obtained if the errors Eij are subGaussian, if we replace 
a in the choice of r above by a suitably large constant C. The proof of this generalization uses 
Proposition 15 in Appendix C in lieu of Lemma 3. The same remark applies for all the results in 
this section. 

The next result obtains an oracle inequality for A that resembles the oracle inequality for the RSC 
estimator A in Theorem 7. We stress the fact that Theorem 12 below requires that X P {X'X) > 0; 
this was not required for the derivation of the oracle bound on \\XA — XA\\% in Theorem 7, which 



holds for all X. We denote the condition number of M = X'X by cq(M) = Xi(M)/X p (M). 
Theorem 12. Assume that E has independent iV(0, a 2 ) entries. For 

r=(l + 9)d 1 (X)a( y /n + ^q) 

with > arbitrary, we have 



the inequality holds up to multiplicative numerical constants (depending on 9). 

To keep the paper self contained, we give a simple proof of this result in Appendix A. Similar 
results for the NNP estimator of A in the general model y = X{A) + e, where X is a random linear 
map, have been obtained by Negahban and Wainwright (2009) and Candes and Plan (2010), each 
under different sets of assumptions on X. We refer to Rohde and Tsybakov (2010) for more general 
results on Schatten norm penalized estimators of A in the model y = X{A) +e, and a very thorough 
discussion on the assumptions on X under which these results hold. 

Theorem 10 shows that the error bounds of the nuclear norm penalized (NNP) estimator A and 
the RSC estimator A are comparable, although it is worth pointing out that our bounds for A are 




Furthermore, 
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much cleaner and obtained under fewer restrictions on the design matrix. However, there is one 
aspect in which the two estimators differ radically: correct rank recovery. We showed in Section 
2.2 that the RSC estimator corresponding to the effective value of the tuning sequence fi e has the 
correct rank and achieves the optimal bias- variance trade-off. This is also visible in the left panel of 
Figure 1 which shows the plots of the MSE and rank of the RSC estimate as we varied the tuning 
parameter of the procedure over a large grid. The numbers on the vertical axis correspond to the 
range of values of the rank of the estimator considered in this experiment, 1 to 25. The rank of 
A is 10. We notice that for the same range of values of the tuning parameter, RSC has both the 
smallest MSE value and the correct rank. We repeated this experiment for the NNP estimator. The 
right panel shows that the smallest MSE and the correct rank are not obtained for the same value 
of the tuning parameter. Therefore, a different strategy for correct rank estimation via NNP is in 
order. Rather than taking the rank of A as the estimator of the rank of A, we consider instead, for 



100 200 



400 500 600 700 
Tuning parameter 





Rank estimate T 

MSE 






\ 



50 100 150 200 250 300 350 400 450 500 
Tuning parameter 



Fig 1. The MSE and rank of the estimators RSC (left) and NNP (right) as a function of the tuning parameter. 
The rank estimate and MSE curves are plotted together for a better view of the effect of tuning on different 
estimation aspects. 



M = X'X, 

(8) k = max{£; : d h (MA) > 2t}. 

Theorem 13. Let r = r(A) and assume that d r (MA) > 4r. Then 

>r}< F{d!(X'E) > t}. 



If E has independent N(0,a 2 ) entries and r = (1 + 9)o-di(X){yJn + \/q)> the above probability is 
bounded by exp ( — 6 2 (n + q) /2) . 
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PROOF. After computing the sub-gradient of f(B) = \\Y — XB\\^p + 2t||B||i, we find that A is 
a minimizer of f{B) if and only if there exists a matrix J with d\{J) < 1 such that X'X{A — A) = 
X'E + tLIJV', where A = UDV' is the full SVD and U and V are orthonormal matrices. The 
matrix J is obtained from D by setting Ja = if Da = and Ju < 1 if Da > 0. Therefore, 

d x (MA - MA) < dx(X'E) + r. 

From Horn and Johnson (1985, page 419), 

1 4 (Mi) - d k (MA)\ < d\(MA - MA) < 2r 

for all k, on the event d\(X'E) < r. This means that d k (MA) > 2r for all k < r and d k {MA) < 2r 
for all k > r, since d r (MA) > 4r and d r +\(MA) = 0. The result now follows. □ 

4. Empirical Studies. 

4.1. RSC vs. NNP. We performed an extensive simulation study to evaluate the performance 
of the proposed method, RSC, and compare it with the NNP method. The RSC estimator A was 
computed via the procedure outlined in Section 2.1. This method is computationally efficient in 
large dimensions. Its computational complexity is the same as that of PCA. Our choice for the 
tuning parameter /x was based on our theoretical findings in Section 2. In particular, Corollary 4 
and Corollary 8 guarantee good rank selection and prediction performance of RSC provided that 
fi is just a little bit larger than a 2 (y/n + \/q) 2 - Under the assumption that q < m, we can estimate 
a 2 by S 2 ; see Section 2.4 for details. In our simulations we used the adaptive tuning parameter 
fJ-adap = 2S 2 (n + q). We experimented with other constants and found that the constant equal to 2 
was optimal; constants slightly larger than 2 gave very similar results. 

We compared the RSC estimator with the NNP estimator A and with the proposed trimmed or 
calibrated NNP estimator, denoted in what follows by NNP( C ). The NNP estimator is the minimizer 
of the convex criterion \\Y — XB\\p + 2r||5||i. By the equivalent SDP characterization of the 
NNP-norm given in Fazel (2002), the original minimization problem is equivalent to the convex 
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optimization problem 

(9) min \\Y - XB\\ 2 F +T(Tr(W 1 )+Tr(W 2 )) 

BeRp xn ,w 1 es n - 1 ,W2eSp- 1 



s.t. 



>- 0. 



Wi B T 
B W 2 

Therefore, the NNP estimator can be computed by adapting the general convex optimization algo- 
rithm SDPT3 (Toh et al. 1999) to (9). Alternatively, Bregman iterative algorithms can be developed; 
see Ma et al. (2009) for a detailed description of the main idea. Their code, however, is specifically 
designed for matrix completion and does not cover the multivariate regression problem. We imple- 
mented this algorithm for the simulation study presented below. The NNP( C ) is our calibration of 
the NNP estimator, based on Theorem 13. For a given value of the tuning parameter r we calculate 
the NNP estimator A and obtain the rank estimate r from (8). We then calculate the calibrated 
estimator as the reduced rank estimator Bf , with rank equal to r, following the procedure 
outlined in Section 2.1. 

In our simulation study we compared the rank selection and the estimation performances of the 
RSC estimator RSC| a d ap , corresponding to p a dap, with the optimally tuned RSC estimator, and the 
optimally tuned NNP and NNP( C ) estimators. The last three estimators are called RSC|„ a ;, NNP| wa / 
and NNP^|„„|. They correspond to those tuning parameters fi va i, r va i and T va i, respectively, that 
gave the best prediction accuracy, when prediction was evaluated on a very large independent 
validation set. This comparison helps us understand the true potential of each method in an ideal 
situation, and allows us to draw a stable performance comparison between the proposed adaptive 
RSC estimator and the best possible versions of RSC and NNP. 

We considered the following large sample-size set up and large dimensionality set up. 

Experiment 1 (m > p). We constructed the matrix of dependent variables X = [x±,X2, • • • ,x m ]' 
by generating its rows X{ as i.i.d. realizations from a multivariate normal distribution MVN(0, £), 
with Sjfc = /o' J '~ fe ', p > 0, 1 < j,k < p. The coefficient matrix A = bB^Bi, with b > 0, Bq is a 
p x r matrix and B\ is a r x n matrix. All entries in Bq and B\ are i.i.d. iV(0, 1). Each row in 
y = [yi, ■ ■ ■ , Um]' is then generated as yi = x^A + Ei, 1 < % < m, with E{ denoting the i-th row of 
the noise matrix E which has m x n independent iV(0, 1) entries Eij. 
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Experiment 2 (p > m(> q) ). The sample size in this experiment is relatively small. X is generated 
as X S 1 /2 ) w here = pV~ k \ £ RP x p, X = XiX 2 , X\ G R mxq , X 2 G R qxp and all entries of 
Xi,X2 are i.i.d. iV(0, 1). The coefficient matrix and the noise matrix are generated in the same way 
as in Experiment 1. Since p > m, this is a much more challenging setup than the one considered in 
Experiment 1. Note however that q, the rank of X, is required to be strictly less than m. 

Each simulated model is characterized by the following control parameters: m (sample size), 
p (number of independent variables), n (number of response variables), r (rank of A), p (design 
correlation), q (rank of the design), and b (signal strength). In Experiment 1, we set m = 100, p = 
25, n = 25, r = 10, and varied the correlation coefficient p = 0.1,0.5,0.9 and signal strength 
b = 0.1,0.2,0.3,0.4. All combinations of correlation and signal strength are covered in the simula- 
tions. The results are summarized in Table 1. In Experiment 2, we set m = 20, p = 100, n = 25, 
q = 10, r = 5, and varied the correlation p = 0.1, 0.5, 0.9 and signal strength b = 0.1, 0.2, 0.3. The 
corresponding results are reported in Table 2. In both tables, MSE(yl) and MSE(X^4) denote the 
40% trimmed- means of 100 • \\A — B\\%/{pn) and 100 • \\XA — XBWj?/(mn), respectively. We also 
report the median rank estimates (RE) and the successful rank recovery percentages (RRP). 

Summary of simulation results. 

(i) We found that the RSC estimator corresponding to the adaptive choice of the tuning param- 
eter Padap = 2S 2 (n + q) has excellent performance. It behaves as well as the RSC estimator that 
uses the parameter p tuned on the large validation set or the RSC estimator corresponding to the 
theoretical p = 2a 2 (n + q). 

(ii) When the signal-to-noise ratio SNR := d r {X A) / {^q + -y/n) is moderate or high, with values 
approximately 1, 1.5 and 2, corresponding to b = 0.2,0.3,0.4, and for low to moderate correlation 
between the predictors (p = 0.1,0.5), RSC has excellent behavior in terms of rank selection and 
means squared errors. Interestingly, NNP does not have optimal behavior in this set-up: its mean 
squared errors are slightly higher than those of the RSC estimator. When the noise is very large 
relative to the signal strength, corresponding to b = 0.1 in Table 1, or when the correlation between 
some covariates is very high, p = 0.9 in Table 1, NNP may be slightly more accurate than the RSC. 

(iii) The NNP does not recover the correct rank, when its regularization parameter is tuned by 
validation. Both Tables 1 and 2 show that the correct rank r (r = 10 in Experiment 1 and r = 5 
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Table 1 

Performance comparisons of Experiment 1, in terms of mean squared errors (MSE(XA), MSE(A)), 
median rank estimate (RE), and rank recovery percentage (RRP). 











RSC| adap 




RSC|„ al 


NNP|„ a ; 


NNP (c) |„ al 


b = 0.1 






0.9 


MSE(XA), MSE(A) 


16.6. 5.3 




16.3, 5.2 


11.5, 3.0 


16.5, 5.3 




RE, RRP 


6, 0% 




6, 0%, 


12, 0%, 


6, 0%, 






0.5 


MKE(XA). MSE(A) 


18.7, 1.4 




18.1, 1.4 


16.2, 1.1 


18.1, 1.4 




RE, RRP 


8, 0% 




9, 40% 


16.5, 0% 


9, 35% 


P 




0.1 


MSE(XA), MSE(A) 


19.3, 1.0 




18.0, 0.9 


16.9, 0.8 


18.0, 0.9 




RE. RRP 


9, 0% 




10, 75% 


17, 0% 


10, 65% 












b 


= 0.2 






P 




0.9 


MSE(XA), MSE(A) 


18.4, 7.0 




17.9, 7.1 


15.9, 5.4 


17.9, 7.1 






RE, RRP 


8, 0% 




9, 20% 


16, 0% 


9, 15% 


P 




0.5 


MKE(XA). MSE(A) 


16.7, 1.3 




16.7, 1.3 


18.9, 1.5 


16.7, 1.3 




RE, RRP 


10, 100% 




10, 100%, 


19, 0% 


10, 100%, 


P 




0.1 


USE(XA). MSE(A) 


16.5, 0.9 




16.5, 0.9 


19.2, 1.0 


16.5, 0.9 




RE, RRP 


10, 100% 




10, 100%, 


18, 0% 


10, 100%, 












b 


= 0.3 






P 




0.9 


MSE(XA), MSE(A) 


17.4, 7.0 




17.3, 6.9 


17.7, 6.7 


17.3, 7.0 




RE, RRP 


10, 65% 




10, 95% 


18, 0% 


10, 80% 


P 




0.5 


MSE(XA) . MSE(A) 


16.4, 1.3 




16.4, 1.3 


19.8, 1.6 


16.4, 1.3 




RE, RRP 


10, 100%, 




10, 100%, 


19, 0% 


10, 100 % 


P 




0.1 


MSE(XA). MSE(A) 


16.4, 0.9 




16.4, 0.9 


19.9, 1.1 


16.4, 0.9 




BE. RRP 


10, 100%, 




10, 100%, 


19, 0% 


10, 100%, 












b 


= 0.4 






P 




0.9 


MSE(XA), MSE(A) 


16.8, 6.6 




16.8, 6.7 


18.7, 7.4 


16.8, 6.8 




RE, RRP 


10, 100%, 




10, 100%, 


18, 0% 


10, 85% 


P 




0.5 


MSE(XA) . MSE(A) 


16.3, 1.3 




16.3, 1.3 


20.3, 1.7 


16.3, 1.3 




RE, RRP 


10, 100%) 




10, 100%, 


20, 0%, 


10, 100%, 


P 




0.1 


MSE(XA). MSE(A) 


16.3, 0.9 




16.3, 0.9 


20.3, 1.1 


16.3, 0.9 




RE, RRP 


10, 100%, 




10, 100% 


20, 0%, 


10, 100%, 



Table 2 

Performance comparisons of Experiment 2, in terms of mean squared errors (MSE(XA), MSE(A), 
median rank estimate (RE), and rank recovery percentage (RRP). 









RSC| atiap 


RSC|„ a , 


NNP|„ ai 


NNP (c, L al 


b = 0.1 




= 0.9 


MSE(XA), MSE(A) 


29.4, 3.9 


29.4, 3.9 


36.4, 3.9 


29.4, 3.9 


P 




RE, RRP 


5, 100%, 


5, 100%, 


10, 0%, 


5, 100%, 


P 


= 0.5 


MSE(XA), MSE(A) 


29.1, 3.9 


29.1, 3.9 


37.2, 3.9 


29.1, 3.9 


RE, RRP 


5, 100% 


5, 100% 


10, 0% 


5, 100%, 


P 


= 0.1 


MSE(XA), MSE(A) 


29.0, 3.9 


29.0, 3.9 


37.2, 4.0 


29.0, 3.9 


RE, RRP 


5, 100%o 


5, 100%, 


10, 0%, 


5, 100%, 


b = 0.2 


P 


= 0.9 


MSE(XA), MSE(A) 


28.9, 15.7 


28.9, 15.7 


38.7, 15.7 


28.9, 15.7 


RE, RRP 


5, 100%, 


5, 100% 


10, 0% 


5, 100%, 


P 


= 0.5 


MSE(XA), MSE(A) 


28.6, 15.7 


28.6, 15.7 


39.0, 15.7 


28.6, 15.7 


RE, RRP 


5, 100% 


5, 100% 


10, 0% 


5, 100% 


P 


= 0.1 


MSE(XA), MSE(A) 


28.7, 15.8 


28.7, 15.8 


38.7. 15.8 


28.7, 15.8 


RE, RRP 


5, 100% 


5, 100% 


10, 0% 


5, 100% 


b = 0.3 


P 


= 0.9 


MSE(XA), MSE(A) 


28.8, 35.3 


28.8, 35.3 


39.2, 35.3 


28.8, 35.3 


RE, RRP 


5, 100% 


5, 100%, 


10, 0%, 


5, 100%, 


P 


= 0.5 


MSE(XA), MSE(A) 


28.5, 35.4 


28.5, 35.4 


39.5, 35.4 


28.5, 35.4 


RE, RRP 


5, 100% 


5, 100% 


10, 0% 


5. 100 % 


P 


= 0.1 


MSE(XA), MSE(A) 


28.6, 35.5 


28.6, 35.5 


39.3, 35.5 


28.6, 35.5 


RE, RRP 


5, 100% 


5, 100% 


10, 0% 


5, 100%, 



24 F. BUNEA, Y. SHE, AND M.H. WEGKAMP 

in Experiment 2) is overestimated by NNP. Our trimmed estimator, NNP( C ), provides a successful 
improvement over NNP in this respect. This supports Theorem 13. 

In additional simulations, we found that especially for low or moderate SNRs, the NNP pa- 
rameter tuning problem is much more challenging than the RSC parameter tuning. NNP cannot 
accurately estimate A and consistently select the rank at the same time, for the same value of 
the tuning parameter. This echoes the findings presented in Figure 1, and is to be expected: in 
NNP regularization, the threshold value r also controls the amount of shrinkage, which should be 
mild for large samples with relatively low contamination. This is the case for moderate SNR and 
moderate correlation between predictors: the tuned r tends to be too small, so it cannot introduce 
enough sparsity. The same continues to be true for slightly larger values of r that compensate for 
high noise level and very high correlation between predictors. In summary, one may not be able to 
build an accurate and parsimonious model via the NNP method, without further adjustments. 

Overall, RSC is recommended over the NNP estimators, especially when we suspect that the 
SNR is not very low. With large validation tuning, NNP( C ) has the same properties as RSC - they 
coincide when both methods select the same rank. But in general, the rank estimation via 
is much more difficult to tune and much more computationally involved than RSC. 

For data with low SNR, an immediate extension of the RSC estimator that involves a second 
penalty term, of ridge-type, may induce the right amount of shrinkage needed to offset the noise in 
the data. This conjecture will be investigated carefully in future research. 

4.2. Tightness of the rank consistency results. It can be shown, using arguments similar to those 
used in the proof of Theorem 2, that 

pjfe/r} > Pi = ¥{y/Ji< d 2r +i(PE) or d x (PE) < ^Jl-d r (XA)}. 

On the other hand, the proof of Theorem 2 reveals that 

pj£/r} <P 2 = F{d 1 (PE) > mm(^Jl,dr{XA) - ^Jl)} . 

Suppose now that 2fj}/ 2 < d r (XA) and that r is small. Then Pi equals F{d2 r +i(PE) > y/ji} and is 
close to i-*2 = F{d%(PE) > ,yji} for a sparse model. Of course, if // is much larger than d%(XA), then 
P 2 cannot be small. We use this observation to argue that, if the goal is consistent rank estimation, 
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then we can deviate only very little from the requirement 2/W 2 < d r (XA). This strongly suggests 
that the sufficient condition given in Corollary 4 for consistent rank selection is tight. We empirically 
verified this conjecture for signal-to- noise ratios larger than 1 by comparing p\ = d 2 {XA) with /jl u , 
the ideal upper bound of that interval of values of \x that give the correct rank. The value of fj, u was 
obtained in the simulation experiments by searching along solution paths obtained as follows. We 
constructed 100 different pairs (X, A) following the simulation design outlined in the subsection 
above. Each pair was obtained by varying the signal strength b, correlation p, the rank of A and 
m,n,p. For each run we computed the solution path, as in Figure 1 of the previous section. From 
the solution path we recorded the upper value of the fi interval for which the correct rank was 
recovered. We plotted the resulting (/j,i,/j, u ) pairs in Figure 2 and we conclude that the theoretical 
bound on jjl in Corollary 4 is tight. 



The condition for correct rank selection is TIGHT 
1000 1 1 1 1 1 1 1 1 , 
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Fig 2. Tightness of the consistency condition. 



APPENDIX A: PROOF OF THEOREM 12 
The starting point is the inequality 

\\XA-XA\\ 2 F < \\XB-XA\\ 2 f + 2t{\\A-B\\ 1 + - ||2||i} 

that holds on the event di(X'E) < r. The inequality can be deduced from the proof of Theorem 
10. Then, by Lemmas 3.4 and 2.3 in Recht et al (2007) there exist two matrices A\ and A2 such 
that 
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(i) A = A x + 2 2 

(ii) r(Ii) < 2r(B) 

(iii) ||A-S||i = || Al -B||i + \\A 2 \\i 

(iv) ||I- B|||, = Hlx - B\\ 2 F + ||I 2 ||| > ||Ai - B\\ 2 F 

(v) ||!||i = ||2i||i + ||2 2 ||i. 



Using the display above, we find 



\\XA-XA\\ 2 F 

< \\XB - XAf F + 2r {||Ii - B||i + ||! 2 ||i + ||B||i - Pi||i - ||A 2 ||i} 

by (i), (iii) and (v) 

< \\XB - XA\\ 2 F + 4r||2i - B||i 

|2 



< \\XB - XA\\ F + 4ry r(Ai - B) \\A X - B\\p by Cauchy-Schwarz 

< \\XB - XA\\ F + 4r v / 3r(B) || A - B|| F by (ii) and (iv). 

Using A P (M)||A-B||| < ||XA-XB|||, and 2xy < x 2 /2 + 2y 2 , we obtain 

^\\XA-XA\\ 2 p < ^\\XB - XA\\ 2 F + 24r 2 r(B)/X p (M). 

The proof is complete by choosing the truncated GSVD B' under metric M, see Lemma 14 below. 

□ 

APPENDIX B: GENERALIZED SINGULAR VALUE DECOMPOSITION 
We consider the functional 

G(B) = \\XB - XB\\ 2 F = tr((B - B )'M(B - B )) 

with M = X'X = NN and Bo is a fixed p x n matrix of rank r. By the Eckhart- Young theorem, 
we have the lower bound 

G(B)>^2d 2 (XB ) 

j>k 

for all p x n matrices B of rank k. We now show that this infimum is achieved by the generalized 
singular value decomposition (GSVD) under metric M, limited to its k largest generalized singular 
values. Following Takane and Hunter (2001, pages 399-400), the GSVD of Bq under metric M is 
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UDV' where U is an p x r matrix, U'MU = I r , V is an n x r matrix, V'V = I r and D is a diagonal 
r x r matrix, and NBq = NUDV' . It can be computed via the (regular) SVD UDV' of NBq. From 
B' X'XBq = VD 2 V' , the generalized singular values c/j are the regular singular values of NBq. Let 
-Bfe = U k D k V k by retaining as usual the first k columns of U and V. 

Lemma 14. Let B^ be the GSVD of Bq under metric M , restricted to the k largest generalized 
singular values. We have 

\\XBq — XBkW'p = ^2 d 2 j(XBo). 

j>k 

PROOF. Since NB = NUDV' and NB k = NU k D k V£, we obtain 

A = NB - NB k = N Y, = NU {k) D {k) V( k) 

j>k 

using the notation Un^ for the p x (r — k) matrix consisting of the last r — k column vectors of U, 
is the diagonal (r — k) x (r — k) matrix based on the last r — k singular values, and V^ for 
the n x (r — k) matrix consisting of the last r — k column vectors of V. Finally, 

\\XBq - XB k \\ 2 F = ||A|||i = \\NU( k )D( k )V{ k j\\ 2 F 

= tr(v {k) D {k) U[ k) MU {k) D {k) Vl k) ) 

= tr(v {k) D {k) I {k) D {k) V( k) ) = tr(z)f fc) ) = Y d l 

j>k 

Recall that in the construction of the GSVD, the generalized singular values dj are the singular 
values of NBq . Since 

d){NBo) = \ 3 {B'^MBq) = X 3 (B' X'XB ) = d 2 (XB ), 
the claim follows. □ 

Remark. The rank restricted estimator B k given in Section 2.1 is the GSVD of the least squares 
estimator B under the metric M = X'X, see Takane and Hwang (2007). 

APPENDIX C: LARGEST SINGULAR VALUES OF TRANSFORMATIONS OF 

SUBGAUSSIAN MATRICES 

We call a random variable W subGaussian with subGaussian moment Fyy, if 



E [exp(tW)] < exp(t 2 /r 



w 
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for all t > 0. Markov's inequality implies that W has Gaussian type tails: 

^{\W\ >t}< 2exp{-t 2 /(2IV)} 

holds for any t > 0. Normal N(0, a 2 ) random variables are subGaussian with Fyy = a 2 . General 
results on the largest singular values of matrices E with subGaussian entries can be found in the 
survey paper by Rudelson and Vershynin (2010). The analysis of our estimators require bounds 
for the largest singular values of PE and X'E, for which the standard results on E do not apply 
directly. 

Proposition 15. Let E be a m x n matrix with independent subGaussian entries Eij with 
subGaussian moment Ye- Let X be an m x p matrix of rank q and let P = X(X'X)~X' be the 
projection matrix on R[X\. Then, for each x > 0, 

f{d\(PE) > 32r s ((n + g)ln(5) + x)} < 2exp(-x). 

In particular, 

E[d!(PE)} < 151b V^Tg- 

PROOF. Let S 1 ™ -1 be the unit sphere in l n . First we note that 

IIP-EII2 = sup <Pu,Ev>= sup <u,Ev> 

with U = PS'f- 1 = {u = Ps: s G S^ 1 }. Let M be a 5-net of U and N be a 5-net for S n ~ l with 
5 = 1/2. Since the dimension of U is q and||u|| < 1 for each u G U, we need at most 5 9 elements in 
Ai to cover U and 5 n elements to cover S"" -1 , see Kolmogorov and Tikhomirov (1961). A standard 
discretization trick, see, for instance, Rudelson and Vershynin (2010, proof of Proposition 2.4), gives 

H-P-EII2 < 4 max <u,Ev>. 
ueM, veM 

Next, we write < u, Ev >= Y^T= 1 n * < Ei,v > and note that each < Ei,v > is subGaussian with 
moment r^, as 

n 

E [exp(t < E u v >)} = Y[ E [expitvjEij)} < exp(t 2 ^ v 2 /T E ) = exp{t 2 /T E ). 
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It follows that each term in YlT=i u-i < Ei,v > is subGaussian, and < u, Ev > is subGaussian with 
subGaussian moment YaL i u i = ^E- This implies the tail bound 

P{| <u,Ev>\>t}< 2exp{-t 2 /(2T E )} 

for each fixed u and v and all t > 0. Combining the previous two steps, we obtain 

F{\\PE\\ 2 > At} < 5 n+q 2exp{-t 2 /{2T E )} 

for all t > 0. Taking t 2 = 2{ln(5)(n + q) + x}T E we obtain the first claim. The second claim follows 
from this tail bound. □ 

APPENDIX D: AUXILIARY RESULTS 

Lemma 16. Let X be a non-negative random variable with K[X] = fi and F{X — \x > t} < 
exp(—t 2 /2) for all t > 0. Then we have 

E [X 2 ] < /i 2 + ny/27r + 2. 

Moreover, for any £ > 0, we have 

E [(X 2 - (1 + V) + ] < 2(1 + r 1 ) exp(-^V 2 /2). 
Proof. The following string of inequalities are self-evident: 

f'OO f'OO 

E [X 2 ] = / F{X 2 >x}dx<fi 2 + 2xF{X > x} dx 
Jo J n 

< fi 2 + J 2(x + fi) exp (^-^x 2 ^ dx = fi 2 + fiV2n + 2. 

This proves our first claim. The second claim is easily deduced as follows: 

r i r°° 

E (X 2 -(l+f)V) < E[X 2 l {x > {1+m ] = / 2tF{X>t}dt 

/•oo 

< (1 + r 1 ) / 2texp(-t 2 /2)dt 
= 2(l + r 1 )exp(-^V/2). 

The proof of the lemma is complete. □ 
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Lemma 17. Let Z d be a x\ random variable with d degrees of freedom. Then 

F{z d -d< -xV2d\ < exp [ X . ) 

\ ~ J ~ V 2 + 2x^27^ J 

In particular, for any < t < 1, 

F{Z d < (1 - t)d} < exp{-t 2 d/4(l + t)} . 

Proof. See Cavalier et al (2002, page 857) for the first claim. The second claim follows by taking 
x = t{d/2) 1 ' 2 . □ 
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